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I. INTRODUCTION 


A. Cyclic Spectrum Analysis 

Cyclic Spectrum Analysis iS uSed to investigate 
cyclostationary properties of signals and systems. This 
technique generalizes conventional spectral analysis to 
include periodic time variant Signals and systems. Sy cusne 
spectrum analysis 1S well suited for signal detection, 
modulation recognition, signal parameter estimation and the 
design of communications systems. Applications to spaceborne 
systems are possible if the integrated circuit (IC) is 
radiation hardened. [Ref. 1] 

This method of spectral analysis is concerned with signals 
that contain more subtle types of periodicity that do not give 
rise to spectral lines, but which can be converted into 
spectral lines with a nonlinear time-invariant transformation 
of the signal. The spectral correlation density function for 


a discrete real-valued signal x(n) is defined as: 
S,(k) = > ek) ene 
k=-a@ 


which 1s the Discrete Fourier Transform of the cyclic 


correlation function: 


N 


a ig 1 -jna + kan] * 
oe w-egyer DL  txlntk)e eis | enaley) vor] 


where a is the cyclic frequency. 

A particularly useful application of cyclic spectral 
analysis is the investigation of modulation techniques, 
especially spread spectrum. Figure 1 [Ref. 2 p. 28] is a plot 
of the cyclic spectrum of a bipolar phase-shift keyed (BPSK) 
signal. The magnitude of the cyclic spectrum is plotted as 
the height above the a - f plane, where f is the spectral 
frequency and @a is the cyclic frequency. The power spectral 
density function is represented on the a = O line. 

The computational complexity of cyclic spectrum analysis, 
which far exceeds that of conventional spectrum analysis, 
limits its use as a signal and systems analysis tool. The 
operations involved in the algorithms are common to most 
Signal processing algorithms: Fourier transformations, 
convolution, and product modulations [{Ref. 1]. In this 
application, the high number of operations required are too 
great for general purpose computers. Computing the cyclic 
spectrum algorithms can best be accomplished by using 
Application Specific Integrated Circuit Design (ASIC) in Very 


Large Scale Integrated Circuits (VLSI). 





Figure 1. Cyclic Spectrum for a BPSK Signal 


B. GENESIL SILICON COMPILER (GSC) 

One method of ASIC design is silicon compilation. A 
silicon compiler is an automatic translation tool that 
converts a behavioral description to mask level description. 
In other words, a silicon compiler allows an engineer who is 
not expert in IC design, to design an IC. Because of low 
design costs, silicon compilation is also ideal to implement 
an IC design that will not have a large production quantity. 
A major problem with silicon compilers is low component 
density which translates to large silicon area and slower 
clock speed. To alleviate this problem, new versions of 
Silicon compilers are providing more capability in automatic 


floorplanning and routing. 


The GSC provides the user with the capability of designing 
VLSI circuits from high level system description to 
manufacture tapeout by producing the IC circuits from 
architectural descriptions. Huber [Ref. 3:p. 88] states that 
there are two significant limitations to the GSC Version 7.1 
which he used: component density and vertical feedthrough. 
The most significant is the inability to achieve high 
component density. In Huber's parallel multiplier design 
(Ref. 3:pp. 86-88}, an attempt was made to establish vertical 
feedthrough between adjacent multiplier levels with no 
success. Since that time GSC Version 8.0 and the Logic- 
Compiler (AutoLogic) have been installed. This software 
offers more capabilities to overcome these limitations. The 
Logic-Compiler performs synthesis and optimization on an input 
netlist representation of a design to produce an output design 
optimized for area and performance. Appendix A gives a more 


complete description of Genesil 8.0 and the Logic-Compiler. 


ce THESIS GOALS 

The motivation for this thesis is to implement a cyclic 
Spectrum analyzer (CSA) using ASIC VLSI design. The 
fundamental building blocks for the CSA are the floating point 
multiplier, adder, and the Fast Fourier Transform (FFT) 
butterfly. The primary goal of this thesis is to design these 
processing elements: a floating point multiplier, a floating 


point adder, and a rate-1/4 radix-4 complex floating point FFT 


butterfly using a 20-bit word that can operate at a minimum 
rate of 40 MHz. To achieve this goal, investigation of high 
speed arithmetic and the capabilities of Genesil and Logic- 
Compiler is required. Chapter 2 presents an indepth 


investigation of high speed arithmetic design. 


II. HIGH SPEED DIGITAL ARITHMETIC 


A. NUMBER SYSTEMS 
1. Introduction 
Representation of numbers within a digital system is 
accomplished with a group of bits. The number of bits used to 
represent a number determines the Ee tare number of 
representable values. For each additional bit added to the 
representation, the number of representable values doubles. 
For example, there are 2” representable values in aN bit 
binary number. What these values represent depend on the 
number system chosen by the designer. Integer representations 
include ones' complement, two's complement, and excess code. 
Rational number representations utilize the integer number 
systems to implement fixed point and floating point 
representations of fractional numbers. 
2. Integer Number Systems 
a. Unsigned 
The simplest integer system is the unsigned system. 
The binary numbers just represent unsigned numbers. The range 
of representable numbers is from 0 to 2%’, where N is the 
number of bits in the representation. Each bit position k has 
associated with it a value of 2* and the value represented by 


the collection of bits is described as: 


/ igi 


VunSIGNED = yy D axe 2” 
1=0 


Where b. is the one or zero in position i. Unsigned numbers 
are easy to manipulate but they can only represent positive 
integers. [Ref. 4:pp. 31-32] 
b. Two's Complement 

The most common method to represent signed numbers 
is the two's complement number’ system. The range of 
representable numbers in a N-bit word are from -2™! to 2”/ - 1. 
Negative numbers are represented by subtracting the unsigned 
value of the number from 2”. The value for any N bit two's 


complement number is given by: 


N-2 
as Op ie enn ) al Nay a 
1=0 


For example, let N = 4 and the number to represent be -7 = 
~Q0111 in binary. The number is represented in two's 
complement as 2* - 7 which is 1000 - 0111 = 1001 in binary. 
ete ospp. 190-193) 

Although the most significant bit is not defined as 
the sign bit, it still is considered as such. If the most 


Significant bit is set, the value will be negative. This is 


because the most significant bit carries more weight than all 
of the other bits added together. 

The reason that two's complement is such a popular 
system is its circular nature. This is illustrated in Figure 
2 (Ret Sap. 219200 The primary drawback to using two's 
complement number system is that it requires a relatively 
complicated conversion from signed magnitude to two's 
complement or vice versa. Two's complement multiplication 
also reguires more hardware than unsigned or magnitude 
MU Leap liecataon, 

c. Ones' Complement 

Another number representation, ones' complement, 
requires a much simpler conversion procedure from signed 
magnitude. The ones' complement conversion requires only that 
each bit of the signed magnitude binary number be inverted for 
negative numbers. The ones' complement representation of a 
binary number N is formulated by N,,... = (2% - 1) - N, where n 
= the number of bits and N = the unsigned number to be 
inverted. As shown in Figure 3 ([Ref. 5:p. 194}, ones! 
complement also has a circular nature except that it has two 
representations of zero; 0000 and 1111. Ones! complement 
addition develops a special situation when a carry is 
generated. Since there are two zeros in the number systen, 


the sum will be in error by one from the correct answer if a 


carry is generated. This is corrected by the "end-around" 


0011 


0100 4 


0101 





Figure 2. Two's Complement Representation 


0100 4 


0101 





~7 


Figure 3. Ones' Complement Representation 


carry. The "end-around" carry adds one to the sum if a carry 
is generated in the original addition. Using this number 
system requires as much or more hardware to implement 
arithmetic operations as the two's complement system. 
d. Excess Code 

Excess code number representation utilizes an 
excess number that is added to the value of the number to be 
represented. If V is the number to be represented and E the 


excess then the excess cCodemnunberecec. 
S=V+ E. 


For example, the 4-bit excess 8 code for 3 is 1011, and the 
code for -3 is 0101,. Zero is represented by 1000 in binary. 
Excess code can be converted to two's complement by inverting 
the most significant bit of the excess code number. The most 
prevalent use of excess code is to store exponents in floating 
point numbers. 
3. Rational Numbers 
a. Fixed Point 

Fixed point rational number representation is much 
like integer representation except that the radix point is not 
directly to the right of the least significant bit of the 
number. The placement of the radix point is established 
purely to satisfy the requirements of the user or designer for 
fractional or integer numbers. If the information to be 


represented contains fractional values, then assumption of a 
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radix point establishes a fixed point system that is so 
adjusted that it can cover the necessary range [Ref. 4:p. 36]. 
Addition in fixed point systems is done exactly as for integer 
operations. For multiplication, care must be taken to assure 
that the radix point is in the correct place and that the 
correct bits are preserved after an operation. The value of 
a two's complement fixed point system is: 
Nae 
Verxep porwr = ~Py-y X 2" P" + > Je eae 
1=0 
Where p is the position of the radix point; the number of bit 
positions to the left of the least significant bit where the 
assumed radix point is found. If p = 0, then the fixed point 
system would be the same as the integer one. This enables the 
designer to determine the smallest value required to meet the 
needs of the system and select the number system accordingly. 
b. Floating Point 

(1) Format. Many applications require the ability 
to represent information of a much greater or smaller 
magnitude than possible with fixed point systems. The use of 
scientific notation solves this problem in the decimal number 
system. A similar system is used to represent large and small 
numbers in digital arithmetic systems. This number system is 
called the floating point number system. This type of number 


system does not expand the quantity of representable values, 


Jt 


it modifies the way in which the values are interpreted. 
[Refeeed : pew 429 

To specify a floating point number, seven 
different pieces of information are required: base of the 
system, sign, magnitude, and base of the mantissa, and the 
sign, magnitude, and base of the exponent. [Ref. 4:p. 42] In 
most cases, the base of a digital number system, the base of 
the mantissa, and the base of the exponent is 2. A floating 
point number, as described above, will have the following 


format: 


(Sign) Mantissa x BasetXPoNen’ | 


The sign bit denotes the sign of the floating point number, 
Usually represented as a 0 for positive and a1 for negative. 
The mantissa is used to identify the significant bits of a 
number value. The base denotes the radix of the system, 
usually 2. This value is not stored ina digital system but 
is part of the definition of the number system. The location 
of the value of a floating point number on the real number 
line is determined by the exponent. 

(2) Mantissa. The number of bits in the mantissa 
determines the accuracy’ the floating point numbers 
represented. The format of the mantissa usually includes a 
"hidden" bit when representing normalized numbers. The use of 
a "hidden" bit increases the number of representable mantissas 


by 2. To compute the range of the number system, the minimum 


ie. 


and maximum allowable values for the mantissa must be 
determined. 

The minimum and maximum value of the mantissa 
depend on the use of a "hidden" bit and the acceptance of 
denormalized numbers. Normalized numbers are floating point 
numbers that are forced to have a 1 in the most significant 
bit position of the normalized mantissa. Using a "hidden" bit 
for the most significant bit is ideal since it will always be 
1. In the IEEE standard [Ref. 6} for binary floating point 
numbers and in most other systems, the radix point is located 
memcne right of this "hidden" bit. If the system allowed 
denormalized numbers, the "hidden" bit could be 0, thus 
allowing a greater range of representable numbers. For 
example, a 4 bit normalized mantissa with a "hidden" bit has 
a minimum value of 1.0000 and a maximum value of 1.1111. The 
Same system, except that it allows denormalized numbers, has 
a minimum value of 0.0001 and a maximum value of 1.1111. It 
is possible to use any integer number system for the mantissa 
including the systems discussed in para. 2. The most common 
method is signed magnitude, which is the IEEE standard [{Ref. 
6] for floating point mantissas. 

(3) Exponent. The exponent along with the radix of 
the system determines the range of the floating point system. 
The exponent also needs to have a sign to represent floating 


point values less than the smallest representable value of the 


eS 


mantissa. In a normalized base 2 floating point number 
system, the smallest mantissa is one. To represent fractional 
values in this system, the exponent must be negative. For 
example, .5 is represented by a1 in the mantissa and a -1 in 


the exponent: 


x2 — eee 


Like the mantissa, any integer number system would be 
sufficient for the number representation in the exponent, but 
the most commonly used method is excess code. The IEEE 
standard {Ref. 6] uses excess code. 

Zero representation in a normalized system is 
done in the exponent. Usually the smallest representable 
Value in the exponent is reserved to indicate a true zero 
value. This must be done because the "hidden" bit is always 
a one, which means the mantissa 1s always nonzero. In systems 
which allow denormalized numbers, there is a zero in the 
"hidden" bit when a denormalized number 1s represented, 
usually denoted by all zeros in the exponent. In this case, 
true zero 1s represented by the smallest value in the exponent 


and all Zeros in» ehewmanticsa. 


B. HIGH SPEED INTEGER ADDERS 
1. Introduction 
The addition function in an arithmetic computation 


system is the most fundamental of add, subtract, multiply, and 
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divide functions. All of these operations can be implemented 
by some combination of the add function. The full adder cell 
is the fundamental building block in the ripple-carry and 
carry-save adders. The two other high speed adders to be 
discussed, conditional sum and carry-lookahead, are 
synchronous and do not require the use of the full adder cell. 
2. The Full Adder 

The function of a full adder is to add two bits and 
the carry from the next less significant bit to produce a sum 
and a carry out to next more significant bit. A functional 
diagram is shown in Figure 4(a) [Ref. 4:p. 71]. The truth 
table for the function is shown in Figure 4(b). As shown, the 
three input bits, A,, B, and C,, are summed to produce two 


bits, F,, the sum, which has the same significance as the input 


bits, and C which is one bit more significant. Figure 4(c) 


oul 


show the Karnaugh maps for C,, and F, with the resulting sum of 


! 
products Boolean equations. These equations are implemented 
Vien randomilogic as shown in Figure 4(d). (Ref. 7:pp. 70-71] 
3. Two Operand Adders 
a. Ripple-Carry Adder 
The ripple-carry adder is just a group of full 
adders cascaded to the width of the desired word length. The 


@ePot One bit is wired to the C, of the next significant bit. 


oul 


This is not a high speed adder design because it requires two 


is. 





Con = AB +B C,, + AC, 


re A, ® B; ® Ci, 


(c) 


AI-H 
BI-H 


CIN-H 





(d) 


Figure 4. Full Adder Cell Design 
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“4 


F, = A\B\C,, + AB,C, + A,B\C;, + ABC, 


gate delays for every bit in the width of the numbers to be 
added. For example, in the 8 bit adder shown in Figure 5, the 
delay to final C,, is 16 gate delays. This adder can be made 


synchronous by the insertion of appropriately placed flip- 


flops. 
b. Conditional Sum Adder 
In the case of ripple-carry adder, the carry must 
propagate through the length of the word. This 1s an 


unacceptable delay for high speed arithmetic operations. The 
conditional sum adder overcomes this problem by generating 
distant carriers and using these carriers to select the true 
sum outputs from two simultaneously generated conditional sums 
under different carry input conditions. DReta=a7-p. /8] 
Conditional sum adders offer significant speed gains over 
ripple-carry adders by utilizing logic gates and multiplexers 
with small fan in and fan out. The delay is proportional to 
log.N instead of N as in the ripple-carry case. The major 
disadvantage is a large increase in area required for 
hardware. 

A 7-bit two-operand adder using the conditional sum 
algorithm is illustrated in Figure 6 [Ref. 7:p. 79}. For the 
example, the inputs are A = 1101101, and B = 0110110, with no 
external carry in. S,;'s are the conditional sums within the 
adder. Subscripts indicate bit position and superscripts 


indicate the assumption of a carry or no carry into the lowest 


ae, 





Figure 5. 8-bit Ripple Carry Adder 


order bit position of a section. There are [n/k] sections in 
So(k) or S,(k) for an n-bit addition. The number of steps (t) 


required is given by: 
t = flog,nl. 


Where n is the number of bits in the adder. 

The adder represented in Figure 6 has n = 7. 
Therefore, t = flog, 7| = 3 steps are required to complete the 
addition. Step one has a carry and a no carry into each bit 
position, so there are 7 sections. The section size doubles 
for each successive step with a carry and no carry into each 
least significant bit position of each section. The arrows in 


Figure 6 show, for the example inputs, how the carries are 


si: 


A= (103),, 


B= (54);9 
Se(1) 
CPt) 
S$} (1) 


C),,(1) 


S$? (2) 


cP, (2) 


S$? (2) 


C.5 (2) 


S? (4) 


CP (4) 


S;'(4) 


C},, (4) 





Note: The arrows show the actual carries generated between sections. The initial carry to the 
rightmost section is always assumed zero. 


Figure 6. 7-bit Conditional Sum Adder Algorithm 


ac 


generated between sections and how they are eventually used to 
select the true sum and carry outputs. [Ref. 7:pp. 78-80] 
The conditional sum adder can be implemented with 
2-input multiplexers, and gates, or gates, and inverters. The 
initial conditional sums and carries can be generated in 
parallel using the random logic gates indicated above. If the 
conditional sum and carry of ith bit position with no carry in 
are denoted by S° and C,,,° respectively, then the Boolean 


equations for these are: 


Cone ae: 


Ciel et 


Similarly for the conditional sum and carry with a carry in 


(Speand Gases 


i 


Each input bit position must generate these sums and carries. 
The carries will eventually be used to select the final sum 
and carry out. The Conditional Cell (CC) in Figure 7 [Ref. 
7:p. 81] generates these conditional sums and carries. Figure 


7 also illustrates the hardware required to implement the 7- 
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Quintuple 2 = input Multiplexer (MPX) 


C, 
eae 


C, Ss * 5. 5, 


Ca oe ee 


Figure 7. 7-bit Conditional Sum Adder 


bit adder described in Figure 6. The first stage of 
multiplexers selects the Oth bit of the final sum. The second 
level selects the 1st and 2nd bits and the final multiplexer 
outputs the 3rd through the 6th bits and the carry out of the 
final sum. 
c. Carry-Lookahead Adder 

Another adder that overcomes the carry propagate 
problem is the carry-lookahead adder. Carry ripple is 
eliminated by using additional random logic to simultaneously 
generate the carries entering all of the bit positions. This 
results in a constant add time regardless of the length of the 


adder. 
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Let A = A,,,---A,,A49 and B = B,...5,,2, bewene 
inputs to a n-bit carry-lookahead adder. S, and C, are the sum 
and carry outputs of the ith bit position of the adder. Two 


functions must be generated for each bit position to implement 


the carry-lookahead algorithm; carry generate (G,) and carry 


propagate (P.). The Boolean equations are: 
G; 7 A; Bi; 
P, = A,OB;. 


The ith carry generate function produces a binary 1 if a carry 
is generated at the ith bit position independent of the less 
Significant sums and carries. The ith carry propagate 
function produces a 1 if a carry is generated by a carry in 
from the less significant bits. Although the obvious 
implementation for generating the P.'s and G,'s would be to use 
AND gates and exclusive OR gates, a NAND gate implementation 
1S possible and probably more economical in area usage. 
Figure 8 illustrates the NAND gate implementation of a n-bit 
wide carry generate and carry propagate unit for a carry- 
lookahead adder. The following relations result after 
substituting P. and G into the sum and carry equations for a 


full adder cell: 


Bs 7 


Ag-1 B, 


ee Se nee ea 
(i! ee ee ee ee ee ee 





Py, G1 


Figure 8. n-bit Carry Generate/Propagate Unit 


S, -8a DB) OC,., 
PCD Cyiee: 
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= END) Cane 


These equations show that the sum and carry of every bit 
position is dependent only on the simultaneously generated P,'s 
and G,'s and the carry in from the next less significant bit 
position. To make this adder truly parallel and high speed 


these carries must be generated in parallel also. To 
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accomplish this the equation for C,; can be used recursively as 


follows: 
Cy = Gp 1g ay 
C, = G, + CoP, 
= G, + GP, +1C_P2,: 
Co-1 = Gaev * Gree ean | een ie ne 
Where C, is the external carry in of the adder. These 
equations can be realized with random logic. Figure 9 (Ref. 


7:p. 86] is the logic circuit diagram of a 4-bit carry 
lookahead unit. Obviously, the size of the carry lookahead 
unit is limited by the fan-in of the random logic being used. 

The final sum is generated with an array of XOR 
gates as shown in Figure 10 (Ref. 7:p. 85] which is called the 
summation unit. Figure 11 (Ref. 7:p. 89] illustrates how the 
carry generate/propagate unit, the carry lookahead unit, and 
the summation unit are combined together to form an 8=-bit 
carry lookahead adder. 

Fan-in limitations of the random logic used to 
build a carry lookahead unit is a severe constraint that must 
be solved. It can be solved by the using the block carry 
lookahead unit. This unit generates a block propagate (P) if 


a carry into the block would force a carry out of the block. 
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Figure 9. 4-bit Carry Lookahead Unit 
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Figure 10. Summation Unit 


ZS 


Re AAyAyAgAgAAy Ay: B)Bg BB, B,D, Bh, = 8 


Carry Generate/Propagaie Unit 


8 — bit CLA Unit 


| Se il 


Summation Unit 


c..° Ca 


S55 S555 55 8 


Con S €—— (A + B) + C, 


Figure 11. 8-bit Carry Lookahead Adder 


block carry lookahead unit. 


The block also generates a block generate (G) if there is a 
Carry that originated within the block. [Ref. 7:pp. 


Figure 12 (Ref. 7:p. 87] is logic circuit diagram of a 4-bit 


P* = PP,P,Py;; 


G* = G, + GP, + GPP, + G)P,P,P,. 


These blocks can be combined together to create a carry 


lookahead adder of any size. Figure 13 (Ref. 7:p. 


32-bit carry lookahead adder using 4-bit and 8-bit block carry 


lookahead units. 
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Figure 12. 4-bit Block Carry Lookahead Unit 


4. Multioperand Adders 
a. Introduction 

If there is a requirement to add more than two 
numbers together, such aS summing partial products in a 
multiplier, then the number of two operand adders must 
increase. For N inputs there must be N-1 two operand adders. 
For large number N, the adder will be intolerably slow and 
large. The solution is to build multioperand adders from 
random logic. The carry-save adder 1s one example of a 


multioperand adder. 
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Figure 13. 32-bit Carry Lookahead Adder 


b. Carry-Save Adder 

The carry-save adder is constructed from full adder 
cells like the carry ripple. The difference is only that in 
the carry-save adder the carry out from each cell is not 
propagated to the carry in of the next significant cell. This 
Carry out is saved for the next level of adders. This leaves 
three inputs into the adder cell of equal precedence. The cell 
produces one output of the same significance and one output of 
one bit greater significance. Utilizing the adder cell in 
this way is called row reduction. The full adder is a 3-to-2 


row reduction unit. Figure 14 (Ref. 6:p. 101) illustrates 
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Figure 14. Carry Save Adder Trees 


how carry-save adders (CSA) can be configured to produce row 
reduction units with a various number of inputs. Since the 
Carry-save adder does not solve the problem of or complete the 
carry ripple, a two operand adder must be used in the final 


stage of adders to complete the sum. 


C. HIGH SPEED INTEGER MULTIPLIERS 
1. Standard Multipliers 
a. Introduction 
An integer multiply in the binary number system is 
much like that done in the decimal number system. This 
procedure is illustrated in Figure 15 (Ref. 4:p. 83]. PPy - 


PP, are called the partial product rows. Each partial product 
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is generated by a binary multiply: the and function. The 
final product is the sum of the properly aligned partial 
product rows. This indicates the requirement of a binary 
product module (AND) and a summing module to complete the 
multiply Seunct rem. 

Standard multipliers are based on the add-shift 
method for multiplication. Multipliers of this type include 
the standard add-shift, multiple shift, multiple shift with 
overlapped scanning, and the Booth multiplier. The Booth 
multiplier, a variant of the add-shift method, uses string 
recoding. 

b. Standard Add-Shift Multiplier 

The simplest method for doing the multiply is the 
standard add-shift method. Figure 16 (Ref. 4:p. 84] 
lllustrates one implementation of this method using standard 
integrated circuits (IC). This 8 x 8 multiplier has 2 8-bit 
inputs and 1 16-bit output. The multiplier is initially 
loaded into the shift register to provide one bit into the and 
gates with all of the multiplicand to generate the first row 
of partial products. This row is then added with the sum in 
the output register (initially zero). This is called the 
accumulation sum. An 8-bit adder is used because the partial 
product addition is done from the least significant partial 


product to the most significant partial product. The output 
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Figure 15. 5-bit Multiply 


of the adder is then put into the 9 most significant bits of 
the output registers. For the next iteration, the multiplier 
is shifted one right and outputs the next significant bit is 
input to the AND gates with the multiplicand. The result from 
the previous adder iteration is shifted by hard wiring the 
accumulation sum to shift one bit to the right every 
iteration. This shift is necessary to line up the 
accumulation sum with the appropriate bit positions in the 
partial product. To complete the multiply, 8 iterations (8 
PROD-CLK cycles) must be done. For aN x N bit multiply 


there must be N iterations. This is much too slow for high 


speed multiplication. 


oy 


PLL TIA TERT: 6) 


=e TTT 


nh reg) 


FUL TIA, 10017: 0) - 






oe 
oe 


- ter 


je BEC ae 
ic Ineo IS 
as @raeanpesaeaesa 


| ‘273 (register) 













| *273 lregister) 







Figure 16. 8 x 8 Multiplier 


c. Multiple-Shift Multiplier 

The slowness of the add-shift method can be 
alleviated by using more than one multiplier bit per cycle. 
To accomplish this requires multiple bit scanning and multiple 
shifts after each addition. For example, the total number of 
add-shift cycles can be reduced by half, if two multiplier 
bits are examined at a time. The hardware required is greater 
than the one bit scanning method. 

When scanning two bits at a time there are four 
possible actions instead of just add the multiplicand or add 
zero. This decision was made with AND gates. Table 1 shows 


the four situations with the correct values to added to the 
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partial product. The A represents the multiplicand. Ie 
generate 2A the multiplicand A is asynchronously shifted one 
bit position to the right. The decision to add 2A and/or A is 
also done with AND gates. Putte 17 SiRete= 7p. 142) 
illustrates the configuration for two bit scanning. Since 
there will be 3 operands per add: 2A, A, and the previous 
partial product, a carry-save adder is utilized. The carry 
propagate adder (carry-ripple) in Figure 17 can be replaced 
with a faster two operand adder to increase cycle frequency. 
This multiplier is much the same as the standard 
add shift multiplier except that it requires shifting of more 
than one bit and a multioperand adder. The cycles per 
multiply is greatly reduced with a small increase in cycle 
period. As the scan width increases, the required clock 
cycles decrease. But the hardware complexity and the cycle 
period will increase as scan width increases. 
d. Multiple Shift Multiplier with overlapped scanning 
In the nonoverlapped bit scanning method each 
multiplier bit generates one multiple of the multiplicand to 
be added to the partial product [Ref. 7:p. 143}. When the 
scan width gets large then the number of multiplicand 
multiples to be added gets large which decreases cycle 
frequency. The overlapped scanning method attempts to reduce 


the number of multiples to be added there fore reducing adder 


NS 


Table I. Multiplicand Multiples to be added to the 
Partial Product after Scanning 2 Multiplier Bits 


— 


Multiplier Multiplier Multiples 
Sie. al Bit 0 to be Added 


-_ 





complexity and cycle period. The number of multiplicand 


multiples can be reduced by half using the overlapped scanning 


method over the standard multiple-shift method. 


The basis of this method is that execution time Can 


be reduced (cycle period) by shifting across a string of zeros 


in the multiplier [Ref. 7:p 143]. The following describes a 


string of k Consecutive I"s” in the multeip) 1em 


Column Position =~ ..., d*k) 1?kK-1,° 27-2) ee 


Bit. Copeent —~ . aa aye VOT pee ll Jeers. 5 ee 
By the string property: 
gitk _ 91 = gitk-1 + Bd eae girl + 2i, 


the k consecutive ones can be replaced by the 


String: 


34 


following 


Oid 
Partial! 
Product 2A A 


MR, [| Multiplier 
Bits 







Carry—Save Adder, 
(n+ 1) — Bit CSA 


Carry Propagate Adder 
(xn + 2) — Bit CPA 





: New Partial Product 


C.. * Sees Sc 
Figure 17. Adder Unit for Two Bit Scanning 
Multiplier 
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[Ref. 7:p. 143] 

The string states that the string of k ones can be 
replaced by a 1 in the next more significant bit position 
subtracted by a 1 in the next less significant bit from the 
string. The 1 overbar signifies this subtraction. This is 
essentially replacing k consecutive adds with one add at the 
beginning and one add at the end of the string (Ref. 7:p. 
143}. For long strings of ones this is a considerable saving. 


Implementing this method in a 2-bit scan width + the overlap 
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bit will require that addition and subtraction is possible 
during each cycle. 
e. Booth's Multiplier 
The Booth multiplier is a recoding algorithm that 
is also based on the string property. This method is similar 
to the overlapped scanning algorithm except it is used for 
two's complement multiplication. Let B= B,,B;,B,,B,,By, so the 


value of B is: 
Boy = B, x (H16) +B, xX Seo, 6 ey 7 2 
The above equation can be manipulated into: 


Boi, = —-16xX(B,-B;) -8X( B58, 4 x Bee) eee ee 


'g 
The values in the parentheses in the last equation can have 
the values 1, O ,-1. The shift algorithm for this 
multiplication is exactly the same as the 2 bit multiple-shift 
multiplier. The difference is that the possible actions to 
take are add, subtract, or do nothing as opposed to add or do 
nothing in the multiple-shift multiplier. [Ref. 4:pp. 90-91] 
f. Summary 

There are many multiplier designs that would fit in 
the "shift and add" standard category that have not been 
discussed in this section. All of these designs can be 
optimized to increase speed and decrease the number of cycles. 
But if they are characterized as standard then they will be 


recursive. This implies multicycle completion and the 


36 


multiplier is not easily pipelined for high speed operations. 
These multipliers cannot or at least should not be used for 
high speed digital arithmetic. The next section discusses 
multiplier designs that are more appropriate for high speed 
arithmetic. 
2. Cellular Array Multipliers 
a. Standard Parallel Multiplier 

(1) Introduction. The parallel multiplier is based 
on the observation that partial products in the multiplier can 
be computed in parallel [Ref. 8: p. 344]. The partial 
products in the standard add-shift multiplier generated its 
partial products with AND gates one row per clock cycle. For 
a N x N multiply there are N’ partial products. To accomplish 
this there must be N* AND gates. To sum all of the partial 
products the multiplier requires N(N-2) full adder cells and 
N half adder cells. Figure 15 illustrates the partial 
products to be summed together for a 5 x 5 multiply. Since 
the partial products are generated in parallel the primary 
delay in the computation is due to adding the partial products 
mo get the final product. 

(2) Parallel Multiplier Cell. Figure 18 [Ref. 8:p. 
345) is an illustration of the parallel multiplier cell. It 
consists of an AND gate and a full adder. This cell is the 


only part that is required to build a parallel multiplier. 


oy 





Figure 18. Parallel Multiplier Cell 


(3) Parallel Multiplier. The parallel multiplier 
1s an array of parallel multiplier cells arranged to output 
the unsigned product of two unsigned numbers. Figure 19(a) 
(Ref. 8:p. 345] is the multiplier with the partial products on 
each cell. Figure 19(b) has the same arrangement as in Figure 
19(a) but in a square array. The latter arrangement is more 
convenient in VLSI to implement in hardware. It also lends 
itself to pipelining the multiplier into stages to increase 


clock frequency. 
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Figure 19. 4 x 4 Parallel Multiplier 


b. Wallace Tree 

The Wallace tree is a solution for reducing the 
delays due to summing the partial products ina multiplier. 
It takes inputs of the same significance and outputs the sum 
of these inputs. For example, the full adder cell is a 3 
input, two output Wallace tree. Any size Wallace tree can be 
built from the 3 to 2 Wallace tree. Figure 20 (Ref. 7:p. 166] 
illustrates the full adder cell as a 3 to 2 Wallace tree and 
a 7 to 3 Wallace tree built from full adder cells. The 


Wallace tree is nothing more than a multioperand bit-slice 


adder. 


a? 


3 Bit—-Slice Inputs of 2* 7 Bit— Slice 
Inputs of 





Figure 20. Wallace Trees 


c. Summary 
There are many other array multipliers in use but 
their goal is much the same, reduce the partial product adding 
delay. Regardless of method, the tradeoff is coldly clear, if 
the design must be fast then the hardware complexity must be 
high. More hardware translates to higher cost. Although the 
cellular array multipliers are faster, they are also much more 


expensive than the serial add-shift multipliers. 


D. FLOATING POINT ARITHMETIC 
1. Introduction 
Using a floating point number system makes arithmetic 


operations much more complex. For the most part the 
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discussion is for operations with normalized numbers, although 
design differences for denormalized number systems are 
discussed. Multiplication is addressed first because it is 
much simpler than addition. 

Other assumptions to made about the floating point 
number system are: 1) Mantissa is in signed magnitude; 
2) Exponent is in excess code; 3) The floating point number 
system is in base 2. 

2. Floating Point Multiplication 
The product of two floating point numbers A and B 


looks like: 


Ax B=M, x 2°4x M, x 2° 


= (M, x Mz) x 2°A°*%8, 


The product of two floating point numbers iS represented by 
the integer product of the mantissas times 2 raised to the sum 
of the exponents' power. The output sign bit is just the 
exclusive or of the input sign bits. Figure 21 illustrates 
the operations indicated in the previous equations. 

The Exponent Add block is not just a simple integer 
adder. Since the exponents are in excess code, some 
additional random logic must be used. This module must also 
indicate if there is an overflow or an underflow generated by 


the add. The Mantissa Multiply block is an integer multiplier 
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EXPONENT 
ADD 


MULTIPLIER 


EXPONENT 


NORMALIZATION 
ADJUST 





RESULT EXPONENT RESULT MANTISSA 
Figure 21. Block Diagram of a Floating Point 
Multiplier 
module. This block incurs the most delay and uses the most 


logic of the floating point multiplier, so this block must be 
optimally designed for speed and hardware. No special 
circuitry 1S required because the mantissa in this system is 
represented in signed magnitude. The Normalization block can 
be broken up into 3 sub-blocks: Normalizer, Rounder, and 
Postnormalizer. 

The Normalizer sub-block detects if the product from 
the Mantissa Multiply block is a normalized number and 
normalizes it if it is not. Since the multiplier only 
computes products of normalized numbers the product will be 


between 1 and 4. This is illustrated by a 4 bit mantissa 
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multiply with a "hidden" bit of the minimum and maximum values 


that can be represented: 


Minne: 1. Oooo ~ |.0000 — 1. 00000000 ; 


Mascimmum: wile. 1111 x 1.1111 = 11.11000001. 


As the mantissa length increases, the maximum product gets 
closer to 4. This makes the decision to normalize and the 
actual normalization very simple. If there is a 1 in the next 
most significant bit from the "hidden" bit then the mantissa 
must be normalized. To be normalized, the mantissa merely 
needs to be shifted to the right one bit and the exponent 
incremented by one. If the multiplier were to allow 
denormalized numbers, then the product would not necessarily 
be between 1 and 4. To normalize such a number will require 
a Significantly larger amount of hardware to detect and shift 
the most significant 1 from anywhere in the product mantissa. 

The Rounder sub-block rounds the product from the 
Normalizer to the correct number of significant bits for the 
number system. There are many methods for rounding: 
truncation, rounding, unbiased rounding, and jamming to name 
a few. Regardless of the method that is used, when a decision 
is made to round up, a 1 is added to the least significant bit 
of the mantissa. This add could cause carry out which will 
require another normalization process called the 


Postnormalizer sub-block. 
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The Postnormalizer sub-block provides normalization of 
the sum generated by the Rounder sub-block. The function is 
exactly the same as the Normalizer sub-block. 

The Exponent Adjust block is merely an adder for the 
incrementing of the exponent generated by the Normalizer and 
Postnormalizer sub-block. Implementing the multiplier with 
denormalized numbers will require that the adder be capable of 
adding more than |1]| for the Normalizer increment because the 
mantissa product may be less than 1. 

3. Floating Point Addition 

The primary reason that floating point addition is 
more difficult is that the mantissas usually have different 
Significance. Therefore, before doing any arithmetic, they 
must be aligned to perform the mantissa addition. Alignment 
means that the exponents must be equal to correctly add the 


mantissas. The sum of two positive floating point numbers is: 


A + B=M, x 2° + M, x a 


Assuming that E, < E, then E, - FE; 1s negative. The alignment 
of the two inputs is accomplished by shifting the mantissa of 
the smaller input the correct number of positions to the 
right. The number of positions to shift is determined by the 
difference of the two exponents (£, and £;). The sum can now 


be represented by: 
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Where the two values in parentheses are of the same 
precedence. Figure 22 illustrates the general operations in 
a floating point adder. 

The first block, the Zero Test block, is to determine 
if either of the two inputs are true zero. This is required 
only for representing the "hidden" bit of the input mantissas. 
If an operand is true zero then the "hidden" bit will be 
represented by zero. In this way the adder does not have any 
Significant Special handling logic for zero operands. 

Before alignment can be completed, the greater 
exponent must be determined. This is done in the Exponent 
Compare block. This block provides the selection information 
to Mantissa Select block and selects the output sum's correct 
exponent. The simplest way to do this is to subtract one 
input exponent (E£,) from the other (E£,) and test the output for 
a positive number. If it is positive then A is has the 
largest exponent and E£, is the sum's exponent. if oe 1S 
negative then B has the largest exponent and £, is the sum's 
exponent. The outputs from this block are the sum's exponent 
that has not been adjusted yet and a selection bit to select 
which mantissa shall be aligned and which shall not. The 


Mantissa Select block merely selects which mantissa will be 
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Figure 22. Block Diagram of a Floating Point Adder 


shifted and completes the shift using the selection bit that 
is determined in the Exponent Compare block. 

The Adder block completes the addition of the 
unshifted and shifted mantissas. The Normalization and the 
Exponent Adjust are similar to the Normalization and Exponent 


Adjust block of the multiplier allowing denormalized numbers. 


E. SUMMARY 

High speed arithmetic is useful for any application where 
fast computation is required. In signal processing 
applications, high speed arithmetic processing is a 
requirement. The fundamental building block of any spectrum 


analyzer is the FFT butterfly which is made from multipliers 


46 


and adders. The cyclic spectrum analyzer requires large FFTs 
to be computed which implies many multiplies and adds in the 
computation. To compute the cyclic plane in near real time, 
the multipliers and adders must be extremely fast. Chapter 3 
describes the design of FFTs and cyclic spectrum analyzers in 


terms of number of FFT butterflies, multipliers and adders. 
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III. CYCLIC SPECTRUM ANALYZER 


A. INTRODUCTION 

The objective of this chapter is to consider ways to 
implement cyclic spectrum algorithms in near real time. The 
value used to characterize the closeness to real time is 


called the Real Time Factor (F7;): 


COMPUTATION TIME 


F = 
2 COLLECT TIME 


The number of hardware units (p,,) needed to operate at a given 


factor of reall timers. 








Cy, C 
1 OF = = 
P, Acer aN 
Where At = N is the total number of samples processed and C, 


is the number of operations performed by the hardware unit. 
The complexity product (p,, * F;) is defined as a measure of 
the hardware complexity of the implemented algorithms (FFT 
butterflies and cyclic spectrum analyzers) to be discussed. 


(Ref. 1] 


B. FFT DESIGN 
1. Introduction 
The fundamental building block for any spectrum 


analyzer is the FFT butterfly. There are two versions to 
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implement the FFT algorithm: Decimation in Time (DIT) and 

Decimation in Frequency (DIF). The following discussion 

addresses both versions of the radix-2 and radix-4 

butterflies. N-point FFT designs using the radix-2 and radix- 
4 are discussed and compared. 
2. Radix-2 FFT Butterfly 

The radix-2 FFT butterfly is simply a method to 

compute the Discrete Fourier Transform (DFT) of a two point 


sequence. This DFT can be expressed as: 


ail 
PaO) = yy cine Perks < | 
a) 
letting Ww, = e”"", (w,)™ is called the weighting or twiddle 


factor. x(K) 1s then equal to: 


aca eNO. = ccd )e(W.) 


Substituting the value for W,, X(0) and X(1) are: 


X(0) = x(0) + x(1); 


age (0) =) satel es 


The signal flow graph for this algorithm is shown in Figure 23 
and is called the radix-2 FFT butterfly. This algorithm can 
be implemented with an inversion and 2 adds. 
3. Radix-4 FFT Butterfly 
The radix-4 FFT butterfly is a method to compute the 


DFT of a 4 point sequence. This DFT is expressed as: 
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3 
X(k) =) x(n) (W,) =x (0) CW.) °*+x (1) (WG) Sxe(2) (Wi) Sol 2) ee 
n=0 


where kK = 0,1,2)3s0anaes, —=ec 7 cma 


then X(k) is: 


Wik) = xO) + x(1) (29) ey (eee 
The signal flow graph for this algorithm is shown in Figure 24 
and is called the radix-4 FFT butterfly. 
4. N Point FFTs 
a. Introduction 
Obviously, sequences that are to be transformed are 
much longer that 2 or 4 points. They could be as long as a 


million or more points. The N point DFT is given as: 


N= - 2n 


x(k) = Wox(n)e *"; k=0,1,2,...,N-2. 
n=0 
This DFT can be realized with either the radix-2 or radix-4 
FFT butterflies. 
b. Radix-2 FFT 
(1) Emerodverione An N = 27 point FFT can be 
constructed from radix-2 FFT butterflies. Let n and k be 


represented in binary form: 
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x(0) X(0) 


CL) X(1) 


Figure 23. Radix-2 FFT Butterfly 


Sal cs" ig) eae 70\l oie as ~~ eBlas 

Ie A o> 221g oo led Phe 
enen X(kK) is: 

2 i a 
HNES Sap ay og = Ns > aoe 3 (7 ame ea) WE 
py) ee, Tie 0 
Where p is: 
Dp = jas = 2 Gees, “OS oe De) CBT 8 8 6 of Ke) . 


(2) DIT Algorithm. From the previous equation W? 


can be rewritten as: 


oI 





Figure 24. Radix-4 FFT Butterfly 


WP = [Wi2) Style * So) (2) tea aed aielaete  alaleeld 


The first term can be simplified into: 
py er ky-3 pe Ae Fi Selle) ie we" (Koly-1) 
and the second into: 


(217) eo eee apes (Qitek.) (2°"*ne ss) 
W y-1 0 Ny -2 = W 1*Xo er) 


because: 


The rest of the terms are simplified in a similar manner. 


X(k) can now be rewritten as: 
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BMP ss Ky 


Where the W'! 


Sot Vee - 
sa\ gaa eeerie) (he SoM =2) [we (2k, no x 


af 


i aus 


1 as 
fal Jake 


See x [Wea poms = + Ko) No 


s are the twiddle factors for each stage. The FFT 


can be written recursively as a function of the previous 


stages: 
-1 
Cy (loans) - iiT2o) ua > Sm ( varie?) Way Ree 
Nn, _, =0 
y-1 
il 
aan 
Cee i) = 9S x, (Ky, 1, 4,6 ye; 
Nn, .5=0 
V¥-2 
and finally: 
Jt 
-1 Porn 
eC EUV) some (Koy gee Kyipyti,) Ww? *v- Foi Po ; 
ny= 
ae) Cre ey) = xX. KaG . . kee 


Where x,(n) 
stage will 


reversal of 


is the input sequence. The output of the final 
always be in bit reversed order hence the bit 
the last equation. [Ref. 9:pp. 176-178] 


(3) DIF Algorithm. WwW?’ can be rewritten as: 


WP = [wey oo ES) (wi2”Ay-3 + ooo 4 2 ia) 5. cae (we” My-3° sis DNS 


Simplification similar to that done in the DIT algorithm leads 


to the following recursive equations that describe the FFT as 


a function of the previous stages: 
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il. 
ee (2%-4n, + +-++4n)k 
yc eee ic 1,.) os My \ 7223 ¢ ene apoaal aot » ae 
il 


y-1 
ili 
-1 ee 
%_ (Ko, Ky) Myg1 ++ Mo) = 3 X, (Ko, Mya + + + Mo) i ds 
Ayo ae 
and final liye 
dl 
vo. 
x, ketene Dy Maen (or =o Kya Mg) MoKy-1 , 
ny = 


X(k) = 0k 2) ee cn ae 


if 
(Ref. Opp. fi 7 aie oO] 
(4) Complexity of FFT using the Radix-2 Butterfly. 
The complexity for an N point, radix-2 FFT can be computed in 
number of butterflies. Each stage will require N/2 two point 
FFTs (butterflies). The total number of stages is 1log,(N). 
The total number of radix-2 butterflies required to implement 


an N point FFT is: 
Cz = = 1og,N. 


If the butterfly is implemented in a complex number system 
then the number of complex multiplies (C,,) and additions (C,) 


ake 


6 
T 


—log.N; 
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54 


imeme the number of real multiplies (C,,,) and additions (C,,,) 


ns : 


CY 
! 


= 2N10g,N; 


Brm 


Cc 


em nOG, IN. 

If each multiply and add were implemented in hardware then the 
FFT would be a rate-1 operator i.e., in each clock cycle, 2 
complex x values and 1 W value would be input to the butterfly 
and the complex butterfly would produce 2 complex output 
values. However, it would be uneconomical and unwise to 
implement the FFT aS a rate-1 operator with rate-1 
butterflies. The radix-2 butterfly can be designed as a rate- 


1/2 operator which gives a complexity product for an N point 


me of: 


Substituting C, for C, then: 
Fey © lin = MOK ( ithe 


This shows that to achieve real time (F,; = 1) then p,, = 1log,N 
hardware units (complex 1/2 rate radix-2 FFT butterflies) are 
required. (Ref. 1] Reference 1 shows a structure of a radix-2 


FFT constructed using rate-1/2 complex butterfly units. 


2), 3: 


Cc. Radix-4 FFT 
(1) DET Atcgermenn: An N = 48 FFT can be 
constructed from radix-4 FFT butterflies. Let n and k be 


represented in quaternary form: 


a= Aten + Malle oh ee ee Ny; 
k= 41k, oe aP2k,_, + see + ine 
theneM(h) is: 
Fe) 3 3 
X(ky.) >. De Do eed eee 
Ae- 027-0 Tp -4 


if p = nk, then W? is: 
WP = ppt’ ks i ae Ko) (4° nga) “rt ty rit? Ke-a ae * Ss Ko) gee . 
The first term can be simplified into: 
Cees Pee nT ea el Uk neo 
Ni B-1 0 p-1) = 77 ona 
and the Second pinto: 
Claes oe ee ey) (4k, +k,) (4%-2n,_.) 
W al 0 B-2' = fw 1” oq p-2 ; 


because: 


The rest of the terms are simplified in a similar manner. The 
FFT is written recursively as a function of the previous 


stages: 
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3 
A-1 
ee eo 70 (pea SR ena ee 


1g _4= 
Z B 
r : 
BOMOK,; Kyi lp.g¢- 1g) = yy DOMORG IA ere ne 7 1G en i aad eg 
11, _2> 
ana finally: 
6 
-1 oo as 
Pe eae Gein ae kan) Ww SF Eo 2G5, 
ight 

a es Kena hy) = Xa (Koy vs Kyy) 


The twiddle factors of each stage can be simplified as 


follows: 


Prlkong_. S11 hana. 2 1.22) Kang a 7 \ KoMg-1 , 
Ts) ae (We = (e ae (Ww, 


N 
B-2 (4k, +k,) ne- TS \ (4k, +Ko) np. (4k, +k) Ne- ovine eonae 
ngs 1*%o/Mp-2 = (p16) 1*Ko) Mp-2 = (W,,¢) 1*Kq) Mg-2 — (W,) 278-2 (Wi, .) op -2 
amo finally: 
gb-lE ett k k k k 
w' p-1 * gi Tigh = (W,) Pca.) 6-229 x°*'x (Wy) 070. 


The W, terms are the twiddle factors internal to the radix-4 
pucterfily. All of the other W terms are external twiddle 
factors. These factors indicate that the first external 
twiddle factor of the butterfly will always be 1. This means 


that only three twiddle factor multipliers are required. 


7) 


(2) DIF Algorithm. Using a Similar simplification 
and rearrangement of the W exponents as in the radix-2 DIF 
algorithm, the recursive equations that describe the FFT as a 


function of the previous stages are: 


3 
B- +558 + 
x (Ko, Ng-o7 Ps ce pile) = 3 x (Mey, oe je) wis *ne-a Ag) Ko, 
Mp -, =O 
: é 
-1 go a ee 
XK (Ky,K,,Mg-31-++1Mo) = »y Xy (Ko, gos +++ +My) p's" Pea or, 
Np-2 
and finally: 
: p 
att | 
a (Ky eee ne dy Xp-1 (Ko: Me err rey) i FOS 
Ny = 


X (k) ee Cae eG) = Xp (Ko, .-.. Kg) . 


The twiddle factors can be simplified for this algorithm also: 


_ (ho) : aT eee 4K a k 
57! Mp. * + M9) Ky _ (W,) 7-380 ( py.) MB-2K0 yx +s + x (ii lence 


B-1 8 ogee k k k 
yy 4 Np-y No) 4k, = (W,) 78-3 ly see x (Wy) 4k,No = (W,) 7? >, i da x (Wy) at 
and finally: 
Ar *nokp-1 = (W,,) 7oXe-2, 


These equations also indicate only three external twiddle 
factor multiplies per butterfly. The difference is that the 
multiplies are after the butterfly instead of before the 


bUEtCerily. 
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(3) Complexity of FFT using the Radix-4 Butterfly. 
The complexity for an N point, radix-4 FFT can be computed in 
number of butterflies. Each stage requires N/4 4-point FFTs 
(butterflies), and the total number of stages is log,(N). The 
total number of radix-4 butterflies required to implement an 


NM point FFT is: 


C= Flog,N. 


If the butterfly is implemented in a complex number system 
then the number of complex multiplies (C,,) and additions (C,,) 


are: 


Can = =N109,N; 
ey — 2 Logo. 


Then the number of real multiplies (C,;,,) and additions (C,,,) 


is: 


Co ON Og IN; 
aval 
Cros = =z Nilog.N. 


If there is a multiplier and adder for each of these 
operations the FFT would be a rate-1 operator. As in the case 
of the FFT implemented with radix-2 butterflies, implementing 


the FFT in this manner would be uneconomical. The radix-4 


Lone, 


butterfly is best implemented as a rate-1/4 operator which 


gives a complexity product for an N point FFT of: 


The rate-1/4 operator is shown schematically in Figure 25. It 
receives 4 complex x-inputs sequentially in 4 clock cycles and 
Simultaneously inputs 4 multiplying complex twiddle factors 
and produces the 4 next level components sequentially after a 


pipeline delay of ad clock cycles. Substituting C, for C, then: 
Pye LOOM. 


This shows that to achieve real time (F,; = 1) then p,, = 109g,N 
hardware units (complex 1/4 rate radix-4 FFT butterflies) are 
required. {[{Ref. 1] 
d. Comparison 
The complexity of an FFT using the radix-2 and 

radix-4 complex FFT butterflies 1s given in paras. B.4.b.4 and 
B.4.c.3, respectively. Although the complexity of the radix-4 
butterfly is much greater than the radix-2 butterfly, an FFT 
built with radix-4 butterflies has many less butterfly units 

than the FFT built with radix-2 butterflies ((N/4)10g,N as 
opposed to (N/2)log,N). If the complex radix-4 butterfly can 
be built on 1 chip then a large FFT should be built with these 


units. 
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COMPLEX RATE-1/4 


ON RADIX—4 FFT BUTTERFLY 
] 


D°x: 4 (i) 


Wi) 


(for each butterfly, i = 1,2,3,4) 


Figure 25. Rate-1/4 Complex Radix-4 FFT Butterfly 


C. CYCLIC SPECTRUM ANALYZER DESIGN 
1. Input Specifications 

The input to the analyzer is a sequence of floating 
point values that are obtained by sampling a real wideband 
Signal at approximately 50 MHz and then applying the Hilbert 
transform to the digital signal. This implies a bandwidth of 
at most 50 MHz using the complex envelope. Since the signals 
of interest (SOI) for this analyzer are primarily wideband 
Signals, the frequency resolution is not required to be 


extremely small. 
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2. Digital Implementation 
a. Introduction 

There are three algorithms to compute the cyclic 
spectrum stated in Chapter 1: 1) Frequency Smoothing Method 
(FSM), 2) Strip Spectral Correlation Analyzer (SSCA), 3) 
Frequency Accumulation Method (FAM). All of these algorithms 
require a large number of arithmetic computations to be 
implemented and a large amount of hardware to execute them in 
near real time. The objective is to consider ways to 
implement these algorithms in real time. [Ref. 1] 

b. Frequency Smoothing Method (FSM) 

The FSM algorithm consists of two parts: an N 
point spectral correlator and an M point summation unit. Mis 
the time-bandwidth product (At * Af). Figure 26 illustrates 
the architecture for the FSM algorithm. The frequency values 


f and a are denoted by: 








ees +l, 
2N 

oi el! 
N 


Where N is the length of the input sequence, and kK and 1 are 
sequence indices. The complexity of this algorithm, described 


in’ number ser FFT butterrtilies ace 
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M/2 


ie m=z=—M /2+1 


Figure 26. Frequency Smoothing Method Architecture 


complex radix-2 butterflies and the complexity product is: 
lop oan & lei \e 


for rate-1/2 complex radix-2 FFT butterflies. Implementing 


the algorithm's FFTs with radix-4 butterflies, the complexity 


product is: 


Dey = 1OgFA Cc. 


The correlator portion has a requirement for separate 
multipliers enumerated by the complexity product p,, * F; given 


by: 
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Prm2 Mes 


rate-1 real multipliers. [Ref. 1} 
c. Strip Spectral Correlation Analyzer 

The SSCA algorithm consists of four parts: 1) an 
N' point FFT, 2) a down conversion multiplier, 3) a 
correlation multiplier, and 4) an N point FFT. Where LP = At, 
L = N'/4 = N/4M, Afi= 1/N' = M/N, Ae = 1/At =S/N, and aieoee 
N/N' = M. Figure 27 [Ref. 1] illustrates the architecture for 
the SSCA algorithm. (Ref. 1] 

The complexity of this algorithm, described in 


number of FFT butterflies is: 


AtAr, eeoe BJ IEEE i 


+ 


2AtAtTi,, 1 | 
AAF2 22 AL Me 2 Ge 


D2 








Then the complexity product is: 


1B MEAG i 
Po2 Pr = Tap O92 Ag * Ah0S2 AE 


for the rate-1/2 complex butterflies. Using radix-4 


butterflies, the complexity product is derived from the 


complexity of the FFTs: 


Wael aie Ataf 


Af a-2*Af Savas aman 








N N? 
Ce = NiOg.— + ayo = 


Then the complexity product is: 
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a(mL +n) 





Figure 27. Strip Spectral Correlation Analyzer 


Architecture 
4C,; 16 ak 1 AtAf 
° ao et — ei | (oe 
Poe "Pr N At 41 OS g 2Af °°* Af 
for rate-1/4 complex radix-4 butterflies. The remaining 


number of multiplies is: 


ze 


mm 


Pry | Fp = 


for the rate-1 multipliers. [Ref. 1) 

d. FFT Accumulation Method (FAM) 

The FAM algorithm consists of four parts: 1) an N! 
point FFT, 2) a down conversion, 3) a correlation multiplier, 


and a P point FFT. Where N = At = LP, 1/N' = M/N = Af, and L 
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= N'/4. Figure 28 illustrates the architecture of the FAM 
algorithm. | Ret sei) 
The complexity of this algorithm, described in 


number of FFT butterflies is: 


2 eae 
ANGE 


log, 2, + AtAt 


Af 2A f? 





Cho = log,4AtAf. 


The complexity product is: 


See ae = 1og,4A Bye Ge 4109, 


for rate-1/2 complex butterflies. Using radix-4 butterflies, 
the complexity product is derived from the complexity of the 


FFTs: 


NGA 2 | ee tea 


ah 1g eA 4A f? 








2 
Che = Nlog,= + —log,4M = log,44 Camm 


Then the complexity product is: 





4C,. i Bi 
Poe ; 12s = N = 4109. 7¢ a AF 108444 GANT 
for rate-1/4 complex radix-4 butterflies. The remaining 
number of multiplies is: 
4N 4 
Prm F, = aT + 20 = oAGe + ZO 


for the rate-1 real multipliers associated with the down 
conversion, the correlator, and the windowing function. (Ref. 


1] 
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—i2xkmL/N' 


Figure 28. FFT Accumulation Method Architecture 


D. CONCLUSIONS 

Figure 29 [Ref. 1} is a log-log plot of the complexity 
product versus time-bandwidth product (AtAf) for a given Af = 
1/8. Although the FSM algorithm is the simplest conceptually, 
the algorithm is obviously much more complex than the two 
other algorithms. The SSCA algorithm has the smallest 
complexity for large N. Figure 30 is log-log plot of the 
complexity product versus AtAf with a given Af = 1/8 for the 
FAM algorithm using the radix-2 and the radix-4 complex 
butterflies. Chapter 4 discusses the actual VLSI design of a 
rate-1/4 radix-4 complex FFT butterfly and the multiplier and 


adder that is used to construct it. 
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Complexity Product — Three Realizations 


Log Complexity Product 
S BO Cs ee 1S, nv ~t oo 





2 3 4 5 6 
Log Time-Bandwidth Product 
—— revised SSCA —S—- FAM —+— FSM version 


Figure 29. Log Complexity Product vs. Log AtAf 


Complexity Product -— Two Realizations 


og Complexity Product 





eal Time-Bondwidth praee 


—+— Radix-2 Butterflies —@- Radix—4 Butter flies 


Figure 30. Log Complexity Product vs. Log AtAf 
using Radix-2 and Radix-4 Butterflies 
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IV. DIGITAL DESIGNS 


A. INTRODUCTION 
1. Specifications 

Specifications for the multiplier, adder, and FFT 
butterfly are given in Appendix B. The fabline MHS CN10C is 
a 1.0 micron feature size technology. To achieve the 
operating frequency required all designs are pipelined. 

2. Pipelining 

Pipelining is based on separating a logic circuit 
(multiplier, adder, oor FFT) into smaller and faster 
Ssubcircuits. These subcircuits or stages are separated by 
storage registers. The storage registers synchronize and save 
the output of one stage and provide that output as input into 
the next stage. Figure 31 illustrates a pipelined logic 
Circuit. Although the delay from a given input to the correct 
output 1s longer and takes multiple clock cycles to complete 
one operation, this method provides a way to implement the 
logic circuit with a high frequency clock. New inputs must be 
provided into the pipelined circuit every clock cycle to keep 
the pipeline full. Pipelining is ideal for signal processing 
applications because the large amount of data to process will 


always Keep the pipeline full. 
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Figure 31. Pipelined Process 


B. FLOATING POINT MULTIPLIER 
1. Introduction 

The floating point multiplier design had required only 
6 stages to implement a 45 MHz clock frequency. By far the 
most limiting portion (the slowest stage) was the parallel 
multiplier array. Stage 1 is comprised of the Genesil library 
parallel multiplier, the conversion of the input exponents 
from excess code to two's complement, and the XOR of the input 
Sign bits. Stage 2 sums the partial products from the 
parallel multiplier cell and add the two's complement 
exponents together. Stage 3 performs the normalization. 
Stage 4 performs the adding for the rounding function. Stage 


5 performs the postnormalization required due to the addition 
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from rounding and the exponent adjusting due to normalization 
and postnormalization. Stage 6 provides the setting or 
Clearing of all bits in the cases of exponent overflow or 
underflow. The multiplier will be described in the following 
paragraphs as illustrated in Figure 21 without regard to 
pipelining stage boundaries. 
2. Multiplier Block 

This block is the hardware that computes the product 
of the input mantissas. Since the mantissas each have a word 
Mengih Of 14 bits including the hidden bit, the product will 
be 28 bits wide. In this case, the Genesil library parallel 
multiplier provides the 14 least significant bits of the 
product but only provides two partial products for the next 13 
more Significant bits. These partial products must be summed 
together to get the 14 most significant bits of the product. 
The conditional sum adder provides the required speed with an 
acceptable amount of hardware used. The Genesil library 
multiplier also provides pregenerated "sticky" bits required 
for rounding. Sticky[1]} is the OR of the least significant 12 
product bits and sticky[0] is the OR of the least significant 
13 product bits. The output of the Multiply block is the 16 
most significant product bits and the "Sticky" bits. The 12 
least significant bits are adequately represented by the 2 


most Significant of these and by the "sticky" bits. Figure 32 
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Figure 32. Multiply Block 


is a block diagram illustrating the functions completed in the 
Multiplier block. 
3. Exponent Add Block 

The function of this block is to add the two input 
exponents. This would be rather difficult to do leaving the 
exponents in excess code, so the they are converted to two's 
complement simply by inverting the most significant bit of 
each of the exponents. Before the conversion is done, each 
exponent is tested for all zeros which indicates that the 
floating point number associated with the exponent is true 
zero. A flag is generated for this condition so that after 


all the operations are complete in the floating point multiply 
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the output can be set to zero. The exponents are then added 
as two's complement integers uSing a 6-bit carry ripple adder 
with additional logic to detect overflow and underflow. 
Overflow and underflow are indicated at the output of the 
floating point multiplier and they are also used to set or 
clear the output in the cleanup stage. The product exponent 
is then converted to excess code before being output from this 
block. Figure 33 is a block diagram illustrating the function 
of the Exponent Add block. 
4. Normalization Block 
a. Introduction 

As stated in Chapter 2 the Normalization block can 
be broken down into 3 sub-blocks. Figure 34 is the block 
diagram describing the functions executed by the Normalization 
block. The Normalizer sub-block performs the initial 
normalizing of the mantissa product. The Rounder sub-block 
performs the rounding of the mantissa product to the correct 
number of significant bits. The Postnormalization sub-block 
performs the final normalization due to possible carry-out 
during the rounding process. 

b. Normalizer Sub-Block 

The Normalizer sub-block uses the 16-bit product 
output from the Multiplier to perform the initial 
normalization of the mantissa product. Since the number 


system only allows normalized numbers the input mantissas will 


7s 





Figure 33. Exponent Add Block 
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Figure 34. Normalization Block 


always be between 1 and 2 and the product mantissa will always 
be between 1 and 4. This implies that normalization is only 
required when the product mantissa is greater than or equal to 
een 10,. This occurs when the most»significant bit of the 
product mantissa is a1. If this is the case, the product 
mantissa is shifted 1 bit to the right and a 1 is sent to the 
Exponent Adjust block to be added to the product exponent. If 
the most significant bit of the mantissa is a 0, then no 
normalization is done. In either case, the most significant 
bit is dropped since it is the hidden bit of the product 
mantissa. All of these operations are simply completed with 
a 2-input multiplexer. The output of the Normalizer sub-block 
is the 14-bit product mantissa and the add bit for the 
Exponent Adjust block. 
Cc. Rounder Sub-Block 

The Rounder performs the unbiased rounding of the 
product mantissa. The sticky bits and the two least 
Significant bits of the product mantissa from the Normalizer 
sub-block are used through logic modules to determine when to 
add 1 to the 13 most significant bits of the product mantissa. 
The 13-bit product mantissa and the carry-out possibly 
generated is output to the Postnormalizer sub-block to 


complete the required normalization. 
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dad. Postnormalizer Sub-Block 
The Postnormalizer sub-block performs the same 
function as the Normalizer sub-block with the carry-out from 
the Rounder sub-block as the decision maker for normalizing. 
The output of the Postnormalizer sub-block is the final 13-bit 
product mantissa and another add bit for the Exponent Adjust 
Dilgek. 
5. Exponent Adjust Block 

The Exponent Adjust block is merely a 6-bit carry- 
ripple adder to add the excess code product exponent to the 
add bits generated in the Normalization Block. Since these 
add bits are only 1-bit wide they can both be added using one 
input and the carry-in of the least Significant bit of the 
adder. Overflow during this add is detected by a carry-out 
from the adder. Underflow is not possible because the 
function is always an add, not a subtraction. The output of 
the Exponent Adjust block is the final product exponent and an 
overflow bit. 

6. Clean-up 

The Clean-up function is just clearing or setting 
every bit except the sign bit of the product for certain 
special cases. If there was exponent overflow either in the 
Exponent Add or Exponent Adjust blocks then all of the bits 
are set and output overflow bit is set. If there is exponent 


underflow, then all of the bits are cleared and the underflow 
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bit is set. In the case when at least one of the floating 


point inputs is zero, then all of the bits are cleared also. 


C. FLOATING POINT ADDER 
1. Introduction 
The Floating Point Adder is a much more complex design 
than the multiplier. The only Similarities between the 
multiplier and the adder are the Postnormalizer, the general 
structure of the Normalization blocks, and the final clean up. 
The adder design required 14 pipeline stages to implement it 
with a operating frequency over 45 MHz. No single function or 
section of the adder incurred the largest delay. The adder 
will be described in the following paragraphs as illustrated 
in Figure 22 without regard to pipelining stage boundaries. 
2. zero Test Block 
The Zero Test block tests each input exponent for true 
zero. If an exponent is true zero then the Zero Test block 
generates a 0 for the "hidden" bit of its mantissa to be 
passed to the Mantissa Select block. The output of the Zero 
Test block is the input exponents and the 2 "hidden" bits. 
3. Exponent Compare Block 
The Exponent Compare block determines which floating 
point input number has the smaller exponent. After the 
exponents are converted to two's complement, this is 


accomplished by subtracting one from the other and vice versa 


me. 


and checking the sign of one of the differences. Let A be one 
input and B the other and their associated exponents FE, anduaa 
If E, - E, < 0 then £, is the smaller exponent, so the Mantissa 
Select block is signaled to select the mantissa of A for right 
shifting and the difference is the number of bits to shift. 
E, is selected as the sum exponent of the floating point 
number. The outputs of the Exponent Compare block are the sum 
exponent, the mantissa select bit, and the number of shifts. 
Figure 35 is a block diagram of the functions completed in the 
Exponent Compare block. 
4. Mantissa Select Block 
a. Introduction 

The Mantissa Select block performs the overall 
function of shifting the mantissa of the floating point number 
with the smallest exponent to the right the correct number of 
bits to align it with the other mantissa. Figure 36 is a 
block diagram of the Mantissa Select block. 

b. Ones' Conversion Sub-Block 

Since the floating point numbers are in signed 
magnitude, the easiest and least hardware intensive number 
system to accomplish the sum is ones! complement. The 
conversion requires only to invert all the bits of the 
mantissa if the number is negative. This sub-block also 


concatenates the "hidden" bit as determined in the Zero Test 
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Figure 36. Mantissa Select Block 


is 


block. The outputs of the Ones' Conversion sub-block are 2 
15-bit mantissas. 

c. Selector Sub-Block 

The Selector sub-block determines, using the 

mantissa select bit generated in the Exponent Compare block, 
which mantissa will be shifted and which will not. This is 
accomplished with two 2-input multiplexers. The outputs of 
the Selector sub-block are the two mantissas. 

d. Align Sub-Block 

The Align sub-block shifts the mantissa selected 
for alignment done in the Selector sub-block. The number of 
bit positions shifted to the right was determined in the 
Exponent Compare block. This shifting function is done with 
a barrel shifter. A barrel shifter uses logic not sequential 
circuitry to complete a shift. The output of the Align sub- 
block is the 29-bit shifted mantissa. 
5. Adder Block 

The Adder block performs the actual addition of the 
two mantissas. The integer adder used is the Conditional Sum 
adder. It provides the requisite speed without using a large 
amount of chip area. The most significant 15 bits of the 
shifted mantissa is added to the unshifted mantissa. Both 
mantissas are sign extended one bit to prevent carry-out due 
to overflow or underflow. Thus, the only reason for a carry- 


out would be to generate an "end-around" carry. The addition 
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of the "end-around" carry is done with 16-bit half~-adder since 
the only inputs are the sum and the 1-bit "end-around" carry. 
The output of the adder block is the sum of the mantissas 
concatenated with the 15 least significant bits of the shifted 
mantissa. Figure 37 is a block diagram of the Adder block. 
6. Normalization Block 
a. Introduction 

The Normalization block of the floating point adder 
is Similar in structure to the one in the floating point 
melciplier. It can also be broken into the Normalizer, 
Rounder, and Postnormalizer sub~-blocks. It differs only in 
the Normalizer sub-block since the mantissa sum could be any 
number between 0 and 4. This means the leading nonzero bit 
could be in any bit position. The Normalizer must be able to 
detect and shift accordingly. Then Postnormalization sub- 
block is exactly the same as the one in the floating point 
multiplier. 

b. Normalizer Sub-Block 

Before the mantissa sum can be normalized it must 
be converted from ones' complement to signed magnitude. Then 
the Normalizer sub-block uses a programmable priority encoder 
to sense the leading nonzero bit. This priority encoder can 
generate any bit pattern for a given bit position of the 
leading nonzero bit position. The pattern in this case is the 


number of bit positions to shift mantissa. But the exponent 


On 
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Figure 37. Adder Block 


must be adjusted also, so another encoder is used to generate 
the two's complement number to be added to the exponent in the 
Exponent Adjust block. The mantissa will be shifted with a 
barrel shifter. The outputs of the Normalizer sub-block are 
the two's complement number to be added to the exponent, the 
29-bit normalized mantissa sum, and the sign bit. 
c. Rounder Sub-Block 

The Rounder sub-block must determine from the 17 
least significant bits with logic whether to round the 
mantissa sum up or to truncate. The outputs of the Rounder 
sub-block are rounded 13 mantissa sum bits and the carry-out 


possibly generated by the addition due to round up. 


82 


7. Exponent Adjust Block 
The Exponent Adjust block provides the adder to add 
the two's complement number and the postnormalization exponent 
adjustment bit generated in the Normalization block to the 
exponent selected for output. Since the exponent is only 6 
bits wide, the carry-ripple adder was used. The output of 


this block is the exponent of the final sum. 


D. RADIX-4 FFT BUTTERFLY 
1. Introduction 
The design 1s a rate-1/4 radix-4 complex floating 
point FFT butterfly. Rate-1/4 implies that one data point is 
input and one FFT point is output every clock cycle. The 
butterfly can be separated into four parts: an external 
twiddle factor multiplier; a shift register and latch; an 
internal twiddle factor multiplier; and a 4-input adder. The 
external twiddle factor at the input implies the DIT 
eugerithm. Figure 38 is the block diagram of the rate-1/4 FFT 
poecerfly. 
2. External Twiddle Factor Multiplier 
The twiddle factor multiplier is just a complex 
multiplier to facilitate computation of large FFTs with the 
radix-4 butterfly. The complex multiplier will require 4 
floating point multipliers and 2 floating point adders to 
implement. Figure 39 is the block diagram of the external 


twiddle factor multiplier. 
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Figure 38. Rate-1/4 Radix-4 FFT Butterfly 


3. Shift Register and Latch 

To compute a 4-point FFT at 1/4 rate, 4 data points 
must be clocked in and held for four more clock cycles. 
Figure 40 is the block diagram of the shift register and 
latch, and the internal twiddle factor multiplier 
(multiplexers). The shift register portion is just 3 
Standard D-registers with their outputs connected to the 
inputs of the next register and to the inputs of a latch. A 
modulo-4 counter allows the 4 data points to be clocked in and 
when the counter goes from 11, to 00, it generates a carry-out 
which 1s ANDed with the PHASE X clock and the output strobes 


the latch. The latch holds the four data points for 4 clock 
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(REAL) 





Figure 39. External Twiddle Factor Multiplier 


cycles at such time the latch is strobed again to hold 4 more 
Gata points. 
4. Internal Twiddle Factor Multiplier 
The equation for the 4-input DFT computation was given 


in Chapter 3: 


pee) = x (0) + x(1) (-7)* + x(2) (-1)* + x(3) (7)* | aaa 6 yrs a rea 


To generate the correct summands for a given k, multiplexers 
and logic will be used. The first summand (x(0)) is constant 
with respect to k, so it is left unchanged. The second and 
fourth summands (x(1) and x(3)) require 4-input multiplexers 


to complete the product. The third summand (x(2)) is just 
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COMPLEX 
INPUT 





ADDER 
S2 
$3 
COUNTER 
ENABLE LOAD 
Figure 40. Shift Registers and Latch + 


Multiplexers 


inverted for odd k. The different k's are generated by the 
counter in the shift register and latch portion. 
5. 4-Input Complex Adder 
This adder is just 3 2-operand complex floating point 
adders arranged aS in Figure 41. Each 2-operand complex adder 
requires 2 2-operand real floating point adders which gives a 
total of 6 2-operand real floating point adders to implement 


the 4-input complex floating point adder. 
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Figure 41. 4-Input Complex Adder 
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BED 
CUIPUE 


V. CONCLUSIONS AND RECOMMENDATIONS 


A. CONCLUSIONS 

The rate-1/4 radix-4 complex floating point FFT butterfly 
was successfully designed and simulated in VLSI at a clock 
frequency of approximately 45 MHz. The design is large and 
will be expensive to fabricate. It utilizes 4 floating point 
multipliers, 8&8 floating point adders, 2 4-input multiplexers, 
3 data registers, 1 latch, and some assorted logic. The 
Silicon area of the IC including pads is approximately 200,000 
mils’. Appendix B describes the IC in detail. 

Logic-Compiler made this design feasible because of its 
ability to optimize the design for area and performance. If 
Logic-Compiler were not available and the author had to rely 
on the Genesil Standard layout compiler, the adder and 
multiplier would each be about 40,000 mils’ in area and have 
an operating speed of less than 40 MHz. A complex multiplier 
implemented on one IC chip would not have been possible, let 
alone a radix-4 FFT butterfly. Logic-Compiler has made the 
Genesil Standard Compiler obsolete except for IC chip 
floorplanning which requires user floorplanning for pad 


placement. 
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RECOMMENDATIONS 


The author makes the following recommendations: 


Investigate further the commercially available IC chips 
for FFT computations and control path design. 


Purchase 1.0 micron radiation hardened fabrication line 
library for Genesil. 


Investigate fabrication costs for the FFT butterfly IC 
Ghuips2 


Begin design of chip sets for large FFTs and ultimately 
the cyclic spectrum analyzer. 


Design a bonafide 4 input adder to replace the one built 


from 3 2-input adders which will reduce silicon area of 
the FFT butterfly design. 


Sg 


APPENDIX A. AUTOLOGIC 


A. GENESIL SILICON COMPILER 
1. Introduction 

The Genesil Designer is an integrated set of automated 
ASIC design tools that contain the IC design expertise 
necessary to transform a functional specification into a data 
base from which an IC can be produced. Genesil provides 1) 
High-level design entry allowing system designers to create 
dense physical designs for integrated circuits; 2) Rapid 
feedback on key performance metrics during exploratory and 
detailed design stages; 3) Verification tools for simulation, 
timing analysis, and layout; 4) Compiler libraries that can be 
expanded with user developed compilers; 5) The ability to 
import layouts designed with other CAD tools; 6) Multiple 
fabrication options for process-independence of designs. 
Figure 42 (Ref. 10:p. 1-1) +.illustrates the Genesil 
environment. Input to Genesil is done with high-level 
functional descriptions, using a combination of forms-based 
entry and schematic capture. Output consists of layout files 


that are sent to an IC manufacturer for fabrication. 


90 















Design Entry with 
Forms and Schematic Capture 
Automatic Netlist Processing 
Logic Synthesis 


oorplannin 


Automatic 
Test Vector 
Generation 








UDV Tools 


Figure 42. Genesil Environment 


2. Design Process 
a. Introduction 
There are three phases in the Genesil design 
process: 1) Design Entry; 2) Design Verification; 3) Design 
Manufacture. Prior to beginning the design process, the 
designer must determine the required logic functionality, the 
physical requirements for timing, size, and power consumption, 
and the testability requirements. 
b. Design Entry 
(1) Introduction. Design entry in Genesil is 
called forms-based entry. Basically, the design parameters 


are input via menus and forms that Genesil provides. The 


gi 


design process is initialized by selecting the type of module: 
Parallel Datapath, Block, Random Logic, General. Random Logic 
modules include functions that range from logic gates to 
multiplexers and full adder cells. Blocks include RAM, ROM, 
PLAs, pads, and parallel multiplier cells. Parallel Datapath 


modules are optimized for parallel data and control 


operations, including: arithmetic and logic functions; bus- 
structured interface operations; and parallel control 
operations. General modules contain a number of sub-modules 


that are of any type module described above. [{Ref. 11:p. 1-1] 
There are four basic tasks to be accomplished in design entry; 
header definition, specification definition, netlisting, and 
floorplanning. 

(2) Header Definition. The header form provides 
the Genesil user input to specify fabrication line, function 
type, IC package type, and compiler type. The fabrication 
line can be selected from the list provided in the header 
form. Header forms for Block module provides the different 
functions that can be selected. If the design is for a 
complete chip, the header form provides different packages in 
the Genesil library for selection. Compiler type can either 
be Standard Genesil or Logic-Compiler. 

(3) Specification Definition. Definition of the 
logical function of the module is completed during 


specification definition. The specification menu and form 
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Vary according to the module type selected. Generally, this 
form allows the user to define the module specifics. In 
parallel datapath modules bus widths and type of bus drivers 
are specified. In a random logic module, the actual function 
is defined, such as adder, multiplexer, or logic gates. 
Blocks are specified with width and depth of RAM, ROM, PLAs, 
or parallel multipliers. All modules provide the ability to 
specify names of nets, the ability to specify the clocks, and 
the ability to create sub-modules. 

(4) Netlisting. Netlisting is performed in a 
general module to specify the interconnections between sub- 
modules and to specify which nets will be external and 
internal to the general module. The netlisting can be done 
explicitly during the specification definition process by 
naming interconnecting nets the same name but the nets will 
still have to be designated as internal or external during the 
netlist process. 

(5) Floorplanning. Floorplanning is performed in 
a general module to physically arrange the sub-modules in the 
module and to specify the locations of external nets on the 
edge of the module. Genesil also provides a program called 
Flair that gives more control over sub-module placement and 
wire routing. If the block compiler is Logic-compiler then no 
floorplanning is required because it is done automatically 


within the block compiler. 
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c. Design Verification 

(1) Introduction. Design verification is "Gite 
process to verify a circuit for correct functionality and 
performance. Genesil includes tools to verify the logical 
functionality, the physical performance, and the layout of the 
design before fabrication. The ability to verify first the 
logic and then the performance makes each process faster. 
Genesil also checks for electrical design rule violations, net 
inconsistencies, and illegal bus merging during block 
compilation. (Ref. 10:p. 1-5] 

(2) Simulation. The Genesil simulator provides 
functional models and a demand-driven engine for rapid 
feedback, a test-vector assembler, a Genie-based interface, 
which offers both programmatic and command-line control, 
modeling, and debug capability. [{Ref. 10:p. 1-5] High-level 
functional models provide rapid feedback on logical 
verification. Switch-level models (GSL) provide final circuit 
verification. Test vectors can be generated with the test 
vector assembler (MASM) or with Genie check functions. [Ref. 
125 

(3) Timing Analysis. The Genesil timing analyzer 
predicts performance base on timing models of the design, 
which were generated during block compilation. Reports that 


are generated include maximum clock frequency, minimum clock 
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phases, setup and hold times, and output delays. [Ref. 13:p. 
1-1] 
d. Design Manufacture 
Genesil generates tooling tapes in industry 
standard (CIF) or GDSII tape formats for photomask or 


customized for in-house fabrication process. [{Ref. 10:p. 1-6] 


B. LOGIC-COMPILER 
1. Introduction 
Leogie—-Compiler ws an alternative compiler that 
provides optimal designs for Genesil modules. Figure 43 [Ref. 
14:p. 1-1] displays the Logic-Compiler design process. 
Essentially, Logic-Compiler takes Genesil modules and makes 
then compatible with Autologic. Autologic then produces 
optimized designs. These optimized netlists are then used to 
produce Genesil layout and timing models. 
2. Optimization Process 
Logic-Compiler enables the user to control design 
performance, Poueleraning, aspect ratio, Pinout, and 
feedthroughs for faster, easier-to-make tradeoffs between 
performance and density. [Ref. 14:p. 1-1] Figure 44 
illustrates the Logic-Compiler optimization process. Input 
into Logic-Compiler is the simulation model of a Genesil 
module and the Logic-Compiler control editor. The simulation 
model from Genesil is a netlist of simulation primitives. The 


logic optimization is for area minimization is done in 
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Define a block or module 
using GENESIL design entry 


In the Header Form, choose 
LoglcCompller 








Use the Logiceem plley Form 
to assign compilation parameters 





Select COMPILE LAYOUT 
command to complle the 
block or module 


Measure performance with 
the GENESIL Timing Analyzer 


Figure 43. Logic-Compiler Design Process 


Autologic. Autologic produces an optimized netlist of the 
Simulation model. The Lpar Netlist block places logic cells 
and puts wire routing between rows. The L Compiler block 
generates Genesil layout and timing models from the optimized 
netlists. [Ref. 14:p. 1-3) 
3. Using Logic-Compiler 
a. Introduction 

The Logic-Compiler option is used in place of the 
Genesil block compiler. Logic-Compiler uses a Genesil defined 
module as input to produce a single block of layout by 
composing its source netlist from the block simulation 


netlists and object netlist definitions in the module. The 
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Logic-Compiler compiled module does not require floorplanning 
because all cell placement is done automatically. In’ ea 
general module where the module is defined by a group of sub- 
modules, the lower level compile options are ignored. Logic- 
Compiler creates a layout for the complete module, regardless 
whether the sub-modules have the Standard compiler or the 
Logic-Compiler option. 

Placement in the floorplanning function is not 
required to define a module that is compiled with Logic- 
Compiler but the pinout portion still must be completed. This 
1s done in the Logic-Compiler control form and menu along with 
defining the compile parameters for the module. 

b. Logic-Compiler Control Editor 

The Logic-Compiler control editor allows’ the 
designer to choose the level of CPU effort for area and 
performance optimization. Parameters that can be specified 
are number of logic rows, cellset, and the level of CPU 
effort. The number of logic rows can either be specified by 
the FORCE option or automatically chosen with AUTO. There are 
three cellsets: 1) IOTA1 1 is tailored to 1.5 to 3.0 micron 
processes; 2) IOTA1 2 is also tailored toml.5 tous 7Geni cma 
processes but is larger and faster than IOTA1 1; 3) LOTA2@re 
tailored for 1.0 micron processes. The level of CPU effort 
can be set for low, med-low, medium, med-high, high, and 


maximum for both area and performance optimization. 
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c. Customizing the Optimization 

In the Logic-Compiler menu, the optimization can be 
further customized by: 1) Defining the clocking regimes with 
the EDIT REGIMES command; 2) Specifying the clock edges with 
the SPECIFY CLOCKS command; 3) Defining timing constraints 
with the EDIT CONSTRAINTS command; 4) Specify locations of 
external connectors with the SELECT CONNS command; 5) Specify 
the number and location of router feedthroughs with the 


DISPLAY FEEDTHRUS command. 


C. AUTOLOGIC 
1. Introduction 
a. Components 

Autologic performs synthesis optimization on an 
input netlist to produce a netlist optimized for area and 
performance. Figure 45 (Ref. 1s oye il) illustrates 
Autologic's components. Files are designated with dashed 
lines and programs are designated with solid lines. 
Essentially, Autologic is a netlist processor and a logic 
synthesis engine. The netlist processor is a general purpose 
netlist manipulation tool that can read a large variety of 
netlists, including Genesil netlists, manipulate netlists, and 
write out netlists in any format. The synthesis engine reads 


the netlist and optimizes the design, based on a target 


MS, 





Figure 45. AutoLogic Components 


technology database and user-supplied controls and constraints 
(via Logic-Compiler), using its built-in algorithms for 
optimizing, reading, writing, and scanning netlists and 
performing timing analysis. (Ref. 15:p. 1-2 - 1-3] 
b. Optimization Flow 
The optimization flow in AutoLogic consists of: 

1) Mapping the input netlist into target primitives; 2) 
Optimizing for area; 3) Running timing analysis; 4) Optimizing 
for performance and running timing analysis again until the 
performance constraints are met. Optimizing for area, which 
is measured in number of logic gates, is done in the synthesis 


engine. The process is called logic reduction. Timing 
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analysis is done to determine if the constraints generated by 
the user via Logic-Compiler are met. If they are not met, 
then optimization for performance is done. Then the timing 
analysis is done again. This will continue until the timing 
constraints are met. Figure 46 [Ref. 15:p. 2-3} illustrates 
the optimization flow in AutoLogic. 
2. Optimization Algorithms 
a. Peepholes 

AutoLogic performs most optimization by applying 
pattern rules to selected subcircuits (peepholes) of the 
design. A peephole consists of a set of n source signals, 
where nis the input width, and all gates and nets whose 
function is related only to the source signals. Figure 47 
(Ref. 15:p. 5-14) illustrates a peephole of input width two. 
AutoLogic then calculates the truth table for each net in the 
peephole and attempts to replace it with a more efficient 
circuit from its database. {Refs 15:pp. 5-13 = 5-14) 

b. Signature Synthesis Optimization 

Signature synthesis only modifies combinational 
logic, no sequential cells are touched. This algorithm is 
characterized as greedy because for every peephole it 
evaluates, it substitutes the circuit with the lowest cost 


substitution it<can finds (Ref. 15: peers) 
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Figure 47. Peepholes 


3. Time and Area Tradeoffs 
AutoLogic optimizes for performance by § first 
identifying the critical paths and then optimizing for 
performance by balancing a gain in timing against an increase 
in area. The tradeoff between timing and area is done by 
minimizing cost of every subcircuit. If a substitution 
reduces cost, it is made; if it increases cost, it is not 


made. [Ref. 15:p. 5-17} Cost is determined by the following 


equation: 
New Oia. ey, 
Subs Timing Maxcap 
( ) = (Cell) - (Cell) + ( ye cae ye 
Cost ener Bae Cost Cost 
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Cell cost is the sum of the cost properties of all the cells 
in the subcircuit; timing cost is cost or benefit associated 
with changing the timing of the circuit; maxcap cost is cost 
penalty added if an output drives more than the maximum load. 


[Ref Wok. > 1a 


D. DESIGN COMPARISONS 

Table II compares area and performance of various design 
modules using the Genesil standard block compiler and Logic- 
Compiler (AutoLogic). For every module, the Logic-Compiler 
version is faster. The table also illustrates that dramatic 
improvement is possible for designs comprised of random logic 
as in the case of the 16-bit conditional sum adder. Figure 48 
and Figure 49 are the layouts of the 4-bit block carry 
lookahead unit described in Chapter 2 using the Genesil 
Standard compiler and Logic-Compiler (AutoLogic), 


respectively. 
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Table II. Design Comparisons 
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Figure 48. 4-Bit Block Carry Lookahead Unit from 
the Genesil Block Compiler 
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APPENDIX B. DESIGN SPECIFICATIONS 


FLOATING POINT NUMBER SYSTEMS 


I 


20 bit word size (1 Sign, 6 exponent, and 13 
mantissa). 


Normalized numbers only - IEEE infinity, denormalized 
numbers, and NaNs are not recognized. 


True zero 1s recognized by all zeros in the exponent. 
Mantissa is in Signed magnitude. 
Exponent 1S in excess 2 ‘Goae, 


Smallest magnitude number: 1.0000000000000, x 2”) 
= 4.65661287308,, x 10. 


Largest magnitude number: 1. Lilli 1 ieee a 22) 
= 4.2947051520.,, xeroe 


ADDER AND MULTIPLIER 


1 


2. 


oe 


6. 


Fabrication Line: MHS CN10C. 

Exponent overflow forces largest output (Sum or 
product) and is indicated with an exponent overflow 
antee 

Exponent underflow forces largest output (Sum or 
product) and is indicated with an exponent underflow 
big. 


Pipeline stages -- Multiplier: 6 
Adder: 14. 


Maximum Clock speed: 45 MHz. 


Approximate Area (each): 10,000 mils’. 


RADIX-4 FFT BUTTERFLY 


is 


Initialization Procedure: 
ajmeSet LOADSCOUNT Sebi: cae: 
b) Chock the “cireuiee 
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c) Set LOAD COUNT bit to 0 and ENABLE COUNT bit to 1. 
QeecrEGwrteis teady to Clock in input points. 


Approximate Area: 200,000 mils’. 

Overflow and underflow is indicated if any multiplier 
or adder has exponent underflow or overflow in any 
stage. 

48 pipeline stages. 


Maximum clock speed: 45 MHz. 


Complex Word Format: First 20 bits - Real 
Second 20 bits - Imaginary 
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